perf: Implement specialized aggregates for `COUNT(*)` and `COUNT(expr)` #2397

andygrove · 2025-09-15T00:45:12Z

Which issue does this PR close?

N/A

Rationale for this change

TPC-H q1 time improves from 12.55s to 11.57s (8% speedup).

What changes are included in this PR?

This PR fixes some old tech debt around the way we implemented COUNT aggregates. Originally, we tried using DataFusion's count aggregate but it was extremely slow when integrated into Comet. We did not find the root cause for this, but instead, we implemented COUNT(expr) as SUM(IF(expr IS NOT NULL, 1, 0)) and this provided better performance. However, this is not efficient since there are intermediate arrays being created from the IF expression. For COUNT(*), which Spark translates to COUNT(1), this was even more inefficient because 1is never null.

This PR implements specialized CountRows and CountNotNull aggregates that are faster and more memory efficient.
The previous SUM approach is still used for counts with multiple arguments, such as COUNT(expr1, expr2), which translates to SUM(IF(expr1 IS NOT NULL AND expr2 IS NOT NULL, 1, 0))
New CometFuzzTestBase extracted from CometFuzzTestSuite
New CometFuzzAggregateSuite

How are these changes tested?

New fuzz tests are added. In CometAggregateSuite, we did not have any tests for count(*) or for count with multiple expressions.

codecov-commenter · 2025-09-15T01:23:47Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 57.51%. Comparing base (f09f8af) to head (aa6c421).
⚠️ Report is 511 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2397      +/-   ##
============================================
+ Coverage     56.12%   57.51%   +1.39%     
- Complexity      976     1295     +319     
============================================
  Files           119      147      +28     
  Lines         11743    13469    +1726     
  Branches       2251     2352     +101     
============================================
+ Hits           6591     7747    +1156     
- Misses         4012     4457     +445     
- Partials       1140     1265     +125

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

andygrove · 2025-09-15T03:13:09Z

.github/workflows/pr_build_linux.yml

          - name: "fuzz"
            value: |
              org.apache.comet.CometFuzzTestSuite
+              org.apache.comet.CometFuzzAggregateSuite


I forgot to add this initially, and CI failed thanks to the recently added checks for this 😄

mbutrovich · 2025-09-15T19:43:17Z

New CometFuzzTestBase extracted from CometFuzzTestSuite

Oh I love this.

This PR fixes some old tech debt around the way we implemented COUNT aggregates. Originally, we tried using DataFusion's count aggregate but it was extremely slow when integrated into Comet.

Did we try if performance improved at all recently? How long ago was this?

comphead · 2025-09-15T22:12:24Z

native/spark-expr/src/agg_funcs/count_rows.rs

+        }
+
+        // Count all rows regardless of null values
+        let array = &values[0];


perhaps https://doc.rust-lang.org/std/vec/struct.Vec.html#method.get_unchecked ? we can avoid boundary check if we checked above the values is not empty

andygrove · 2025-09-16T14:52:27Z

New CometFuzzTestBase extracted from CometFuzzTestSuite

Oh I love this.

This PR fixes some old tech debt around the way we implemented COUNT aggregates. Originally, we tried using DataFusion's count aggregate but it was extremely slow when integrated into Comet.

Did we try if performance improved at all recently? How long ago was this?

This was ~1 year ago. See #744 for more information. Count was ~10x slower than Sum at the time (but only when integrated with Comet).

andygrove · 2025-09-16T14:55:07Z

New CometFuzzTestBase extracted from CometFuzzTestSuite

Oh I love this.

This PR fixes some old tech debt around the way we implemented COUNT aggregates. Originally, we tried using DataFusion's count aggregate but it was extremely slow when integrated into Comet.

Did we try if performance improved at all recently? How long ago was this?

This was ~1 year ago. See #744 for more information. Count was ~10x slower than Sum at the time (but only when integrated with Comet).

I've moved this PR to draft for now. I will split out the testing changes into a separate PR and then experiment again with using DataFusion's count.

andygrove · 2025-09-16T20:56:29Z

Closing this since #2407 is less maintenance

andygrove added 6 commits September 14, 2025 17:56

implement custom CountNotNull aggregate

3bc09b1

count rows

deb44c4

remove bad test

39379de

tests

1822113

format

c2daea8

remove println

9c6e4bb

andygrove added 6 commits September 14, 2025 20:19

fix regression

64d8c09

reinstate multi input count and add tests

1ab3203

fix

8aaa0d8

is_a

ac53707

format

3ad0b7e

add new test to CI workflow

59dcace

andygrove commented Sep 15, 2025

View reviewed changes

andygrove marked this pull request as ready for review September 15, 2025 12:06

comphead reviewed Sep 15, 2025

View reviewed changes

andygrove marked this pull request as draft September 16, 2025 14:54

andygrove mentioned this pull request Sep 16, 2025

perf: Use DataFusion's count_udaf instead of SUM(IF(expr IS NOT NULL, 1, 0)) #2407

Merged

upmerge

aa6c421

andygrove closed this Sep 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: Implement specialized aggregates for `COUNT(*)` and `COUNT(expr)` #2397

perf: Implement specialized aggregates for `COUNT(*)` and `COUNT(expr)` #2397

Uh oh!

andygrove commented Sep 15, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Sep 15, 2025 •

edited

Loading

Uh oh!

andygrove Sep 15, 2025 •

edited

Loading

Uh oh!

mbutrovich commented Sep 15, 2025

Uh oh!

comphead Sep 15, 2025

Uh oh!

andygrove commented Sep 16, 2025

Uh oh!

andygrove commented Sep 16, 2025

Uh oh!

andygrove commented Sep 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

perf: Implement specialized aggregates for COUNT(*) and COUNT(expr) #2397

perf: Implement specialized aggregates for COUNT(*) and COUNT(expr) #2397

Uh oh!

Conversation

andygrove commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

codecov-commenter commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

andygrove Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbutrovich commented Sep 15, 2025

Uh oh!

comphead Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

andygrove commented Sep 16, 2025

Uh oh!

andygrove commented Sep 16, 2025

Uh oh!

andygrove commented Sep 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

perf: Implement specialized aggregates for `COUNT(*)` and `COUNT(expr)` #2397

perf: Implement specialized aggregates for `COUNT(*)` and `COUNT(expr)` #2397

andygrove commented Sep 15, 2025 •

edited

Loading

codecov-commenter commented Sep 15, 2025 •

edited

Loading

andygrove Sep 15, 2025 •

edited

Loading